Kaggle Competition Spaceship Titanic

Datenbeschreibung

https://www.kaggle.com/competitions/spaceship-titanic/overview

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

File and Data Field Descriptions

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
Destination - The planet the passenger will be debarking to.
Age - The age of the passenger.
VIP - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
Name</i> - The first and last names of the passenger.
Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
PassengerId - Id for each passenger in the test set.
Transported - The target. For each passenger, predict either True or False.

Bibliotheken importieren

Data einlesen

Datenübersicht & NULL Werte

Data Visualisierung

Neue Features erzeugen

Gemäß der Beschreibung des ursprünglichen Datensatzes die Informationen in den Spalten: 'Cabin', 'PassengerId', 'Name' in mehrere Spalten aufteilen

--> Cabin: 'CabinDeck', 'CabinNum', 'CabinSide'
--> PassengerId: 'GroupId', 'NumInGroup'
--> Name: FirstName, LastName

Auf der Grundlage der vorhandenen Daten weitere neue Spalte erstellen

--> 'GroupSize' (die Anzahl der Personen in jeder Gruppe)
--> 'TotalSpend' (Gesamtbetrag der von den Passagieren an Bord des Schiffes ausgegebenen Gelder)
--> 'HomePlanet' + 'Destination' = 'Route'
--> 'IsSinge' (reist die Person allein oder in einer Gruppe oder?)
--> 'NoSpend' (der Passagier hat keine Ausgaben an Bord des Schiffes)
--> 'IsChild' (Ist der Passagier minderjährig?)
--> 'namesakes_num_in_group' (Anzahl der Namensvettern in der Gruppe)
--> 'NameLength' (die Anzahl der Zeichen im Passagiernamen)

Verteilung der Zielvariable

Die Gruppen: Transported und Not Transported sind ungefähr gleich groß. 50.4% und 49.6%

Visualisierung von kategorialen Features

Erkenntnisse:

--------- Plot 1 'HomePlanet' --------

--------- Plot 2 'Destination' --------

Visualisierung von numerischen Features

-- Es gibt keine VIP-Passagiere auf Deck G, alle Passagiere auf Deck G sind vom Planeten Erde

Data Vorbereitung

HomePlanet

--> Menschen aus derselben Gruppe fliegen immer vom selben 'HomePlanet'
--> Passagiere mit demselben 'LatName' fliegen von demselben Planeten.

Destination

--> Eins-zu-eins-Abhängigkeit zwischen Gruppennummer oder Nachname und Zielort nicht gefunden, leere Werte durch den häufigsten Zielort ersetzen

LastName

CabinDeck

CabinNum

CabinSide

VIP

CryoSleep

Age

Expenses ("RoomService", "FoodCourt", "ShoppingMall", "Spa", 'VRDeck')

Speicherplatz Optimierung

Correlationsmatrix

Test Daten Vorbereitung

MLs

Logistic Regression

DecisionTreeClassifier

RandomForestClassifier

GradientBoostingClassifier

PCA

SVM

KNeighborsClassifier

XGBClassifier

CatBoostClassifier

LGBMClassifier

ROC PLOT

Modell-Ensemble

Final Forcast